Replication information for "Transfer Learning for Topic Labeling: Analysis of the UK House of Commons Speeches 1935--2014"
Alex Herzog, April 12, 2021

The code files are split into five parts:

1.x: These scripts compute dynamic topics from the full corpus, using the approach presented in Blei, David M., and John D. Lafferty, 2006. "Dynamic topic models." Proceedings of the 23rd International Conference on Machine learning. Topics are estimated using the code provided here: https://github.com/blei-lab/dtm

2.x: These scripts calculate semantic coherence and FREX to find the optimal number of topics, using the approach discussed in Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder‐Luis, J., Gadarian, S.K., Albertson, B. and Rand, D.G., 2014. "Structural topic models for open‐ended survey responses." American Journal of Political Science, 58(4), pp.1064-1082.

3.x: These scripts transfer topic labels from the Comparative Agendas Project codebooks to the estimated topics using the approach discussed in the paper.

4.x: These scripts reproduce the tables and figures presented in the paper and supplementary material.

5: This script labels topics using neural embeddings (the robustness study discussed in the paper), using the approach presented in Bhatia, S., J. Han Lau and T. Baldwin. 2016. "Automatic Labelling of Topics with Neural Embeddings." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics. This script uses code and the pipeline described here: https://github.com/sb1992/NETL-Automatic-Topic-Labelling-


The following data are provided:
- Hansard_speeches_1935-2014.tar.xz: The complete Hansard record from 1935-2014 parsed into a tab-separated file
- MySQL_stopword_list.txt: The stopword list used in '1.1-generate_DTM_file_structure.R' when processing the Hansard records before applying the DTM code
- dfm.RData: The document-frequency matrix generated in '1.1-generate_DTM_file_structure.R'
- meta_data.dat: Document-specific data associated with the Hansard records, generated in '1.1-generate_DTM_file_structure.R'
- word_ids.dat: The word ids generated in '1.1-generate_DTM_file_structure.R'
- data-mult.dat: Input file for the DTM code, generated in '1.1-generate_DTM_file_structure.R'
- data-seq.dat: Input file for the DTM code, generated in '1.1-generate_DTM_file_structure.R'
- 22-topics_100_iterations_dtm_output.tar.xz: The DTM output estimated with 22 topics (other output with a different number of topics is omitted but can be generated using '1.2-runDTM.sh)
- 22-topics_100_iterations_dtm_results.tar.xz: The processed results for the DTM output estimated with 22 topics (other results for output with a different number of topics is omitted but can be generated using '1.2-runDTM.sh)
- policy agendas codebook.tar.xz: The original Comparative Agendas Project codebooks
- policy_agenda_word_lists.txt: The word lists extracted from the original Comparative Agendas Project codebooks that were used for the topic matching
- policy_agenda_topic_labels.txt: The topic labels from the original Comparative Agendas Project codebooks (i.e., the category labels) topic matching
- best_model_matches_DTM_model_22_topics_top_15_words.csv: The best matching results



